This file covers advanced data exploration, such as visualizing spatial data and data transformation. We will also cover the ggmap and leaflet packages.
Data - We always start with specifying the data source of the graph.
Aes - Data mapping to confirm the axes based on data dimensions and positions of various data points in the plot.
It also allows us to specify the form of encodings, such as color, size, shape and so on.
Geom - Geometric objects in which we would like our data to be depicted. Such as points, lines, bars, etc.
Coord(“Co-ord”) - The coordinate system that the visualization is based on. Cartesian or polar.
Scale - The specific scale needed to represent multiple values or a range.
Facet - To create subplots based on specific data dimensions.
Stat - Statistical measures to apply to the data before it is plotted.
Position - How your geometric objects are positioned.
To load the ‘custdata’ dataset
custdata <- read.table('custdata.tsv', header=T, sep='\t')
Here is what we will use in ggplot2 in order to create a density plot of the income of all customers in the ‘custdata’ dataset.
library(ggplot2)
ggplot(custdata, aes(x=income)) +
geom_density(stat = "density", position = "identity") +
coord_cartesian() +
scale_x_continuous() +
scale_y_continuous()
You can see that firstly, we have ‘custdata’ as Data in the ggplot() function.
Then, we have a data mapping specifying that income value will be mapped to the x-axis in Aes.
The Geom ‘geom_density’ indicates that we are using a density plot.
The Stat applied to the data for this plot is ‘density’.
The Position of the geometric object is ‘identity’.
We adopted the Cartesian Coord system and both the x and y-axis use continuous Scale.
However, this is just an example to show the various grammar of graphics for ggplot. In reality, many of these options are already default setting for density plots. A much more simplified command can achieve the same result.
ggplot(custdata, aes(x=income)) +
geom_density()
Here we will focus on the Position parameter in ggplot2.
There are five types of positioning:
Identity - default of most geoms.
Jitter - default of geom_jitter.
Dodge - default of geom_boxplot.
Stack - default of geom_bar, geom_histogram, and geom_area.
Fill - useful for geom_bar, geom_histogram, and geom_area.
Looking back at our ‘custdata’ dataset, how would we compare the number of people buy health insurance across different marital status? We would make use of ‘geom_bar’.
ggplot(custdata) + geom_bar(aes(x=marital.stat, fill=health.ins))
As Stack is the default Position parameter for geom_bar, the bars of different health insurances statuses are stacked vertically atop one another.
Here we see that ‘Married’ customers comprise of the largest segment those insured. ‘Widowed’ customers are the smallest segment that isn’t insured.
Sometimes, comparing numbers across Stack positioning might be difficult. Instead, we can use the Dodge positioning to put bars side by side.
ggplot(custdata) +
geom_bar(aes(x=marital.stat, fill=health.ins), position='dodge')
This allows us to compare numbers in-category and also across different categories.
Can we then conclude that because ‘Married’ customers have the highest absolute numbers, they are the best customers for the insurance company due to their large segment size? It may just be that most customers tend to be married, and the absolute numbers may, in fact, be misleading
Using Fill we can get a clear view of relative proportions in each category.
ggplot(custdata) +
geom_bar(aes(x=marital.stat, fill=health.ins), position='fill')
From the plot, we can see that ‘Widowed’ customers clearly have the highest tendency of buying insurance, rather than those that are ‘Married’. ‘Never Married’ customers also have the least tendency to buy health insurance by far.
We transform data for modeling needs, better visualization or better interpretation.
data <- read.csv("area&population-raw.csv")
Here, we show a scatter plot of the area against the population for all countries.
ggplot(data, aes(x=Area..in.sq.mi., y=Population)) +
geom_point(colour='blue')
There is no discernable relationship between Population and Area in this plot. This is a poor visualization as most points are in the bottom left corner, except for some outliers.
Let’s develop a better understanding of our data, maybe that will shed some light on how to improve our plot. First, we can plot a density plot of the area.
ggplot(data, aes(x=Area..in.sq.mi.)) +
geom_density()
The graph seems to have a normal distribution, but with a significant right skew. It is, in fact, a typical log-normal distribution. Which means its logarithm might follow a normal distribution.
ggplot(data, aes(x=Population)) +
geom_density()
Notice that the population is also log-normally distributed. We should do a logarithm transformation in both area and population.
data$area_log = log10(data$Area..in.sq.mi.)
data$population_log = log10(data$Population)
A scatter plot of ‘area_log’ against ‘population_log’.
ggplot(data, aes(x=area_log, y=population_log)) +
geom_point(colour='blue') +
geom_smooth(method = 'lm', color='red')
Notice how every point is shown clearly, and it also shows a clear linear relationship between ‘area_log’ and ‘population_log’. A simple linear regression could build a strong linear model between the two variables. It is also easy for us to interpret the relationship between area and population as well.
Suppose that the linear relationship between ‘population_log’ and ‘area_log’ is:
population_log = a x area_log + c
For country 1, we have:
Country 1: population_log1 = a x area_log1 + c
Suppose that:
Country 2: area2 = n x area1
If we log this equation:
area_log2 = area_log1 + log(n)
With our first linear equation we can derive ‘population_log2’:
population_log2 = a x area_log2 + c = a x(area_log1 + log(n)) + c = a x area_log1 + alog(n) + c = a x area_log1 + c + alog(n) = population_log1 + alog(n)
Using the exponential function, we have:
population2 = n x 10^a x population1
Note that 10^a is a constant, independent of n. Meaning that if the area of a country was to increase by n times, the population will increase proportionally by a factor of 10^a x n as well.
Sometimes, we need to break numerical values into categories and study data behavior in these categories. For example, we might classify people by income group, classify products by sales volume or classify days by rainfall amount, etc.
Often, numerical values are broken down into intervals of equal length or divided by convenient numbers. However, it can be risky. It might not be fair to classify people by 0-2 years old vs 20-22 years old. Babies’ behavior changes significantly from year to year, while youth’s behavior doesn’t change so dramatically across the same period.
It may instead be more reasonable to break down the age group of babies further into 0-1 month, 1-3 months, 3-6 months, etc.
Breaking numerical values into convenient numbers don’t have much statistical foundation, except that the numbers look nice. We may break down annual incomes into 0-10,000, 10,000-50,000, etc. The difference between people earning 9,999 and 10,001 may be insignificant. Instead, data visualization may be a better way to classify numerical values.
ggplot(custdata, aes(x=income, y=as.numeric(health.ins))) +
geom_point(position=position_jitter(w=0.05, h=0.05)) +
scale_x_log10(breaks=c(100,1000,10000,100000)) +
geom_smooth() +
annotation_logticks(sides="bt")
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 79 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
The plot shows customer income vs whether they buy health insurance. The blue smoothing curve in the middle of the figure is the best fit curve approximating the ratio of customers buying health insurance with respect to their income.
Notice that the curve is quite flat when income is below 20,000, but increases sharply after income exceeds 20,000 and becomes flatter again after income exceeds 100,000. This gives us a hint that we should break income into three groups, 0-20,000, 20,000-100,000, and >100,000 if we are interested in how income affects the decision to buy health insurance.
It is important to understand your data before taking action. The foundation of what descriptive analytics is about.
Absolute figures compared across two categories might not make much sense before normalization. As having SGD$2,000 income in Singapore affords a vastly different lifestyle from SGD$2,000 income in Indonesia. It is not just the exchange rate and the cost of living, but also the difference in wealth between the two nations that make such a direct comparison tricky.
We can first normalize the data in order to compare incomes across the two countries. Using the median or average income in each nation as the benchmark and generate a relative ratio between each person’s income against the benchmark of the country they live in.
Looking back at our ‘custdata’ dataset, we can see that customers in the dataset come from different states and it would be unfair to directly compare their incomes. The income levels in different states vary a lot.
Here is the script from ‘GetMedianIncome.R’.
#This script will obtain USA state medium incomes
library(XML)
library(RCurl)
## Loading required package: bitops
theurl<-"https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_income"
urldata <- getURL(theurl)
data<-readHTMLTable(urldata,stringAsFactors=FALSE)
median.income <- as.data.frame(data[4])
colnames(median.income) <- c("Rank","State","Per.Capita.Income","Median.Household.Income","Median.Family.Income","Population","No.Households","No.Families")
median.income <- median.income[-c(1),]
#Convert Per.Capita.Income from text to numeric
median.income$Per.Capita.Income <- as.numeric(gsub(",","",gsub("\\$","",median.income$Per.Capita.Income)))
We now have the dataset ‘median.income’ that contains per capita income across the different states of the US.
We will now merge this with our ‘custdata’ dataset.
custdata <- merge(custdata, median.income, by.x = "state.of.res",
by.y = "State")
We will then normalize the income with this command.
custdata$income.normalised <- with(custdata,
income/Per.Capita.Income)
We can then use these normalized income values instead of the original income values.
Spatial data is now widely available. Studying patterns or behaviors on a geographical basis is becoming a common interest for business and data scientists.
Ggmap allows us to visualize spatial data on a map. Remember to follow the pdf file from week 1 to setup google and ggmap.
‘ggmap’ is a package developed to work on top of ggplot2 for visualizing spatial data. So it is compatible with ggplot2. Ggmap can be viewed as simply another layer in ggplot2. This is advantageous as you can overlay all other ggplot2 plots with ggmap plots.
To plot a map of a country, state or city, you can obtain the map just by specifying its respective name.
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
m <- get_map("Singapore", source="google")
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx-fKwOanMfOTME
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Singapore&key=xxx-fKwOanMfOTME
ggmap(m)
m <- get_map("Washington", source="google")
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Washington&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx-fKwOanMfOTME
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Washington&key=xxx-fKwOanMfOTME
ggmap(m)
We can also use the zoom parameter to zoom in or out.
m <- get_map("Singapore", zoom=11, source="google")
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx-fKwOanMfOTME
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Singapore&key=xxx-fKwOanMfOTME
ggmap(m)
There are also several maptypes to choose from, such as satellite, hybrid, toner, watercolor, terrain-background, or toner-lite.
m <- get_map("Singapore", zoom=11, maptype="satellite",
source="google")
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=satellite&language=en-EN&key=xxx-fKwOanMfOTME
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Singapore&key=xxx-fKwOanMfOTME
ggmap(m)
There are also four different sources of a map: google, osm, stamen, and cloudmade. Only stamen is open, with the rest requiring API keys.
m <- get_map("Singapore", zoom=11, source="stamen")
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=terrain&key=xxx-fKwOanMfOTME
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Singapore&key=xxx-fKwOanMfOTME
## Source : http://tile.stamen.com/terrain/11/1613/1015.png
## Source : http://tile.stamen.com/terrain/11/1614/1015.png
## Source : http://tile.stamen.com/terrain/11/1615/1015.png
## Source : http://tile.stamen.com/terrain/11/1613/1016.png
## Source : http://tile.stamen.com/terrain/11/1614/1016.png
## Source : http://tile.stamen.com/terrain/11/1615/1016.png
## Source : http://tile.stamen.com/terrain/11/1613/1017.png
## Source : http://tile.stamen.com/terrain/11/1614/1017.png
## Source : http://tile.stamen.com/terrain/11/1615/1017.png
ggmap(m)
You can see that stamen’s maps are not as nice as Google’s. However, it is a possible alternative if you don’t want to go through the trouble of enabling Google API key.
Here we will use Pizza Hut location data to illustrate how to integrate ggmap with other ggplot2 functions.
pizzahut.location <- read.csv("PizzaHut.csv", header = TRUE,
colClasses = c("character",
"character",
"factor",
"character",
"numeric",
"numeric"))
This data contains the address, region, longitude and latitude of all Pizza Hut branches in Singapore.
map <- get_map("Singapore", zoom=11, source="google")
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Singapore&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx-fKwOanMfOTME
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Singapore&key=xxx-fKwOanMfOTME
Next, get the map and then add ggplot as the base layer to the map.
m1 <- ggmap(m, base_layer = ggplot(data = pizzahut.location,
aes(x=lon, y=lat)))
Finally, use a “geom_point” function to plot points on the location of pizza hut branches, given by the longitude, which is mapped to the x-axis and the latitude, which is mapped to the y-axis in the plot. The points are further colored by the region they belong to.
m1 + geom_point(aes(color=Region))
You are also highly recommended to read how to integrate Heatmap with ggmap
Sometimes we may want to compare data by regions, not by specific locations. “ggmap” package is actually insufficient for this. Instead, we will have to use the ‘raster’ and ‘rgdal’ packages.
Use the code below to install them, or use the packages pane in Rstudio.
install.packages("raster")
install.packages("rgdal")
The key function is getData() from ‘raster’, which will obtain geographic data for anywhere in the world. It currently supports five sources of data. GADM, countries, SRTM, alt, and world-clim.
library(rgdal)
## Loading required package: sp
## rgdal: version: 1.3-9, (SVN revision 794)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
## Path to GDAL shared files: C:/Users/Alex/Documents/R/win-library/3.5/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: C:/Users/Alex/Documents/R/win-library/3.5/rgdal/proj
## Linking to sp version: 1.3-1
library(raster)
library(XML)
SG <- getData('GADM', country='SG', level =1)
SG
## class : SpatialPolygonsDataFrame
## features : 5
## extent : 103.6091, 104.0858, 1.16639, 1.471388 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## variables : 10
## names : GID_0, NAME_0, GID_1, NAME_1, VARNAME_1, NL_NAME_1, TYPE_1, ENGTYPE_1, CC_1, HASC_1
## min values : SGP, Singapore, SGP.1_1, Central, NA, NA, Region, Region, NA, NA
## max values : SGP, Singapore, SGP.5_1, West, NA, NA, Region, Region, NA, NA
This will get us Singapore map data from GADM. Its class is “SpatialPolygonsDataFrame”, which is a special class wrapped around the data frame. Unlike a normal data frame, we will use at ‘@’ instead of dollar sign ‘$’.
The most important variables are ‘data’ and ‘polygons’.
SG@data
## GID_0 NAME_0 GID_1 NAME_1 VARNAME_1 NL_NAME_1 TYPE_1
## 260003 SGP Singapore SGP.1_1 Central <NA> <NA> Region
## 260001 SGP Singapore SGP.2_1 East <NA> <NA> Region
## 259999 SGP Singapore SGP.3_1 North <NA> <NA> Region
## 260000 SGP Singapore SGP.4_1 North-East <NA> <NA> Region
## 260002 SGP Singapore SGP.5_1 West <NA> <NA> Region
## ENGTYPE_1 CC_1 HASC_1
## 260003 Region <NA> <NA>
## 260001 Region <NA> <NA>
## 259999 Region <NA> <NA>
## 260000 Region <NA> <NA>
## 260002 Region <NA> <NA>
class(SG@data)
## [1] "data.frame"
This shows us that ‘SG@data’ itself is a data frame. ‘NAME_0’ stores the name of the country while ‘NAME_1’ stores the names of the five regions in Singapore.
GADM only has data up to this level for Singapore. However, for bigger countries, GADM will have data up to _2, where 1 are the states and 2 are the counties.
SG@Data is the outer layer for the SG object to communicate with other data sets. For example, suppose we create a data frame storing some numbers for each region in Singapore.
data_1 <- data.frame(Region=c('Central', 'East', 'North',
'North-East', 'West'),
value=c(3,4,1,7,10))
data_1
## Region value
## 1 Central 3
## 2 East 4
## 3 North 1
## 4 North-East 7
## 5 West 10
Then we want to merge SG@data with data_1.
SG@data <- merge(SG@data, data_1, by.x='NAME_1', by.y='Region')
SG@data
## NAME_1 GID_0 NAME_0 GID_1 VARNAME_1 NL_NAME_1 TYPE_1 ENGTYPE_1
## 1 Central SGP Singapore SGP.1_1 <NA> <NA> Region Region
## 2 East SGP Singapore SGP.2_1 <NA> <NA> Region Region
## 3 North SGP Singapore SGP.3_1 <NA> <NA> Region Region
## 4 North-East SGP Singapore SGP.4_1 <NA> <NA> Region Region
## 5 West SGP Singapore SGP.5_1 <NA> <NA> Region Region
## CC_1 HASC_1 value
## 1 <NA> <NA> 3
## 2 <NA> <NA> 4
## 3 <NA> <NA> 1
## 4 <NA> <NA> 7
## 5 <NA> <NA> 10
The values from data_1 are now added to SG@data by its respective regions.
Each region in Singapore will later be plotted as a polygon based on data stored in SG@Polygons.
class(SG@polygons)
SG@polygons[1]
It shows a list of location data which is corresponding to the corners of polygons of region 1. (in order to not needlessly lengthen this file, I will not show the element here. However, you can copy and run this yourself)
This is just for your information, and will never touch this data.
spplot(SG, 'value')
It will plot the Singapore map with the given polygon data and color the five regions using different colors based on the values stored in the ‘value’ column of SG@data which originally comes from “data_1”.
In summary, to generate map view by borders, first, download map data from the right source. Next, link your data to the map data. Lastly, plot the map.
A great alternative to ggmap is leaflet. JavaScript is THE programming for the web. Leaflet is one of the most popular open-source JavaScript libraries for interactive maps.
Leaflet package in R is the package for you to integrate and control the JavaScript leaflet library. With a few lines of code, you can make use of the powerful features of the leaflet library.
Refer back to the original work if needed.
Say that we want to show Sentosa Island on a map. Think of an address on the island, such as Sentosa Cove. Visit this website and type in “Sentosa Cove Singapore”. Click “Find” and you will get latitude of 1.239660, longitude of 103.835381.
First, we need to create a leaflet widget.
library(leaflet)
library(tidyverse)
## -- Attaching packages ------------------------------- tidyverse 1.2.1 --
## v tibble 2.0.1 v purrr 0.2.5
## v tidyr 0.8.2 v dplyr 0.7.8
## v readr 1.2.1 v stringr 1.3.1
## v tibble 2.0.1 v forcats 0.3.0
## -- Conflicts ---------------------------------- tidyverse_conflicts() --
## x tidyr::complete() masks RCurl::complete()
## x tidyr::extract() masks raster::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks raster::select()
m <- leaflet()
Then, we add a specific map layer to it.
m <- addTiles(m)
m
It is basically an empty world map, waiting for a specific instruction from us. Leaflet provides many functions for us to add all sorts of layers on top of the base layer.
One of the most popular functions is addMarkers.
m <-addMarkers(m, lat = 1.239660, lng = 103.835381,
popup = "Sentosa Cove")
m
This produces a map that centers at Sentosa Cove. The map is interactive, by clicking on the marker, the popup message we set in the function will pop up. We can also zoom in or zoom out on the map.
The above steps can be simplified into one line of code.
m <- leaflet() %>% addTiles() %>%
addMarkers(lat = 1.239660, lng = 103.835381,
popup = "Sentosa Cove")
m
If our objective is to show tourist attractions in Singapore with Sentosa Island as just one of the many spots, then what we have now is not ideal.
Instead, we can pre-set the zoom level so as to show the whole of Singapore by default. This can be done by adding one more function, SetView in the code.
m <- leaflet() %>% setView(lat = 1.239660, lng = 103.835381,
zoom = 11) %>%
addTiles() %>% addMarkers(lat = 1.239660, lng = 103.835381,
popup = "Sentosa Cove")
m
What about trying to plot a map showing Disneyland in Orlando US?
Singapore has many tourist attractions other than Sentosa Island, how do we add all of them to the map?
First, we need to create a dataframe that contains the longitude, latitude and the names of all tourist attractions.
df <- data.frame(latitude = c(1.25011, 1.28544, 1.318707,
1.282375, 1.4043, 1.4022,
1.4029, 1.2868, 1.3332, 1.2893),
longitude = c(103.83093, 103.859590, 103.706442,
103.864273, 103.7930, 103.7881,
103.7917, 103.8545, 103.7362,
103.8631),
name = c("Sentosa Island", "Marina Bay Sands",
"Jurong Bird Park",
"Gardens By the Bay", "Singapore Zoo",
"Night Safari", "River Safari",
"Merlion Park", "Science Center",
"Singapore Flyer"))
Now, we can use what we did for Sentosa Island previously but with a small change.
Supply df as the data source and the specify the mapping of longitude, latitude and popup message to the columns in df.
leaflet() %>% addTiles() %>%
addMarkers(data = df, lng = ~longitude, lat = ~latitude,
popup = ~name)
Notice that many tourist attractions are concentrated around certain areas. It is very likely that a tourist will visit them all in one go. So, we should instead show them as clusters rather than individual spots.
We will use addCircleMarkers() for this goal.
leaflet() %>% addTiles() %>%
addCircleMarkers(data = df, lng = ~longitude, lat = ~latitude,
radius = 5,
clusterOptions = markerClusterOptions())
The places close by are now shown as just a circle with the number of places on it. Notice that the numbers will change as you zoom in and out.
Similar to ggplot2, leaflet allows us to overlay geometric objects to form more complex views. More importantly, leaflet provides interactive controls that allow users to choose what layers to show on the map.
We will aim to have circle markers for each branch, and for them to have different colors depending on which region they belong to.
Let’s take a look at the pizza hut data again.
pizzahut.location
## Address Zipcode Region Location lon
## 1 Ang Mo Kio 560715 North Singapore 560715 103.8459
## 2 Hougang Mall 538766 North Singapore 538766 103.8939
## 3 Nex 556083 North Singapore 556083 103.8718
## 4 Thompson Plaza 574408 North Singapore 574408 103.8309
## 5 Tao Payoh 310190 North Singapore 310190 103.8494
## 6 North Point Shopping Centre 769098 North Singapore 769098 103.8356
## 7 Causeway Point 738099 North Singapore 738099 103.7862
## 8 The Seletar Mall 797653 North Singapore 797653 103.7509
## 9 Sun Plaza 757713 North Singapore 757713 103.8192
## 10 Waterway Point 828761 North Singapore 828761 103.7509
## 11 Bedok 460215 East Singapore 460215 103.9335
## 12 Tampines Mall 529510 East Singapore 529510 103.9452
## 13 Bedok Mall 467360 East Singapore 467360 103.7509
## 14 OneKM 437157 East Singapore 437157 103.7509
## 15 East Point Mall 528833 East Singapore 528833 103.9530
## 16 Harbourfront Centre 99253 Central Singapore 99253 103.7509
## 17 Lucky Plaza 238863 Central Singapore 238863 103.8338
## 18 Marina Square 39594 Central Singapore 39594 103.7509
## 19 Plaza Singapura 238839 Central Singapore 238839 103.8454
## 20 City Square Mall 208539 Central Singapore 208539 103.8566
## 21 Bukit Panjang Plaza 677743 West Singapore 677743 103.7642
## 22 Bukit Timah Plaza 588996 West Singapore 588996 103.7790
## 23 Chao Chu Kang 689812 West Singapore 689812 103.7450
## 24 West Mall 658713 West Singapore 658713 103.7491
## 25 Jurong Point 648886 West Singapore 648886 103.7065
## 26 Westgate 608532 West Singapore 608532 103.7509
## lat
## 1 1.371011
## 2 1.372629
## 3 1.350644
## 4 1.354637
## 5 1.332654
## 6 1.429848
## 7 1.435855
## 8 1.345871
## 9 1.448417
## 10 1.345871
## 11 1.326057
## 12 1.352661
## 13 1.345871
## 14 1.345871
## 15 1.342806
## 16 1.345871
## 17 1.304360
## 18 1.345871
## 19 1.301016
## 20 1.311403
## 21 1.379962
## 22 1.339001
## 23 1.384848
## 24 1.349649
## 25 1.339874
## 26 1.345871
It contains the address, zip code, region, location, longitude, and latitude of all the Pizza Hut branches in Singapore.
Let’s first create a vector of the four regions used in the data.
region.list <- c("North", "East", "Central", "West")
Next, we will create a colorfactor object. What this does is to map color codes to the regions in our data.
colorFactors <- colorFactor(c('red', 'green', 'blue', 'brown'),
domain = pizzahut.location$Region)
Then, we initiate the leaflet widget.
m <- leaflet() %>% addTiles()
Then, I will go through the four regions to add layers one by one. For each region, pizzahut.region contains data only belonging to the region. Then, we add a layer with circle markers defined in this pizzahut.region. Note that I added two important parameters: One is to specify the color code to be used for this region, and the other is to define a group parameter for this layer.
for(i in 1:4)
{
pizzahut.region <- pizzahut.location[pizzahut.location$Region == region.list[i],]
m <- addCircleMarkers(m,
lng=pizzahut.region$lon,
lat=pizzahut.region$lat,
popup=pizzahut.region$Address,
radius=10,
stroke = FALSE,
fillOpacity = 1,
color = colorFactors(pizzahut.region$Region),
group = region.list[i]
)
}
Next, I am going to add different types of maps. The names given in the functions are the provider names, but I selected four of them. You can visit this website to see a full list of providers. Again, a group parameter is specified for each layer.
m <- addTiles(m,group="Default")
m <- addProviderTiles(m,"Esri.WorldImagery", group = "Esri")
m <- addProviderTiles(m,"Stamen.Toner", group = "Toner")
m <- addProviderTiles(m, "Stamen.TonerLite", group = "Toner Lite")
Finally, I will put all of them together as controls. We need to supply the names of groups defined in the previous functions here.
There are two types of groups, basegroups are the basic layer of the map. Every time, only one of these groups will be shown. That’s why the controls eventually show as radio buttons. Overlaygroups can overlay with each other, meaning they can co-exist. Therefore, they will show as checkboxes.
m <- addLayersControl(m, baseGroups = c("Default","Esri",
"Toner Lite","Toner"),
overlayGroups = region.list
)
m
Finally, you can save the result as an HTML page and embed it on your own website.
library(htmlwidgets)
saveWidget(m, file="pizzahutlocations.html")
In leaflet, addMarkers or addCircleMarkers place markers based on locations of the data points. Sometimes, we may visualize data on a regional basis.
For example, we may want to compare annual rainfall over five region(s) of Singapore. The denser a region is colored, the greater the rainfall it has. The coloring is not perfect here due to the poor quality of the region boundary data.
The following codes show how we can obtain the map data and integrate it with simple made-up data representing rainfall over regions. Go to the Map View by Borders section for a recap on what these commands do.
library(raster)
data_1 <- data.frame(Region = c('Central','East','North','North-East','West'),
Value = c(3,4,1,7,10))
SG<-getData('GADM', country='SG', level=1)
SG@data <- merge(SG@data,data_1,by.x="NAME_1",by.y="Region")
popup <- paste0("<strong>Name: </strong>",SG$NAME_1)
Next, we create a palette with blue as the key color scheme and the color sensitivity proportionated by the value of data.
pal <- colorNumeric( palette = "Blues", domain = SG$Value)
We are now ready to produce the map.
#continuous palette color
m<-leaflet() %>% addTiles() %>% addPolygons(data=SG, weight = 2,
stroke = TRUE, smoothFactor = 0.1, fillOpacity = 0.8, color = ~pal(Value), popup=popup)
m
Sometimes, we may just want to differentiate the regions by different colors. It can be done by using a discrete color setting.
#random color
factpal <- colorFactor(topo.colors(5), SG$Value)
m<-leaflet() %>% addTiles() %>% addPolygons(data=SG, weight = 2,
stroke = FALSE, smoothFactor = 0.2, fillOpacity = 0.8, color = ~factpal(Value), popup=popup)
m
Again, you can save the result as an HTML page and embed it on your own website.
library(htmlwidgets)
saveWidget(m, file="attractions.html")
We can compare data over regions by polygon coloring, in which each region is colored uniformly. We can create a heatmap where the density of the color is based on the density of events at locations.
Here is a simple method to create such a Heatmap. This is developed based on the codes shared on this website.
R has a built-in dataset that contains a series of earthquakes that occurred near Fiji.
Firstly, you need to install a library called “leaflet.extras”.
library(leaflet.extras)
#Some version of RStudio have problems rendering images if the source of the map is on https
#So set this option to direct the view on the web browser
options("viewer" = function(url, ...) utils::browseURL(url))
Let’s first take a look at how the earthquakes are distributed, using the addMarker() function.
#Use markers to mark all the earthquakes on the map
leaflet(quakes) %>% addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
addMarkers(lng = ~long , lat = ~lat)
This is a bit messy as the markers are overlapping with each other.
To generate a heatmap, just use the following command:
#Now create heat map that "heat" is based on the number of earthquakes
#around different locations.
m<-leaflet(quakes) %>% addProviderTiles(providers$CartoDB.DarkMatter) %>%
addWebGLHeatmap(lng=~long, lat=~lat, size = 60000)
m
To provide better contrast, we used a dark map background. You can also change the provider to see different effects. The default size is measured in meters. Above we have each event’s physical radius being 60,000 meters.
We can also color the heatmap based on the earthquake’s magnitudes instead of in the number of occurrences. This can be done by supplying the magnitude as the intensity.
Again, you can save the result as an HTML page and embed it on your own website.
library(htmlwidgets)
saveWidget(m, file="heatmap.html")